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Abstract 

This paper explores the application of methods from information geometry to the sequential Monte 
Carlo (SMC) sampler. In particular the Riemannian manifold Metropolis-adjusted Langevin algorithm 
(niMALA) is adapted for the transition kernels in SMC. Similar to its function in Markov chain Monte 
Carlo methods, the mMALA is a fully adaptable kernel which allows for efficient sampling of high- 
dimensional and highly correlated parameter spaces. We set up the theoretical framework for its use 
in SMC with a focus on the application to the problem of sequential Bayesian inference for dynamical 
systems as modelled by sets of ordinary differential equations. In addition, we argue that defining 
the sequence of distributions on geodesies optimises the effective sample sizes in the SMC run. We 
illustrate the application of the methodology by inferring the parameters of simulated Lotka-Volterra 
and Fitzhugh-Nagumo models. In particular we demonstrate that compared to employing a standard 
adaptive random walk kernel, the SMC sampler with an information geometric kernel design attains a 
higher level of statistical robustness in the inferred parameters of the dynamical systems. 

Keywords: Information geometry. Sequential Monte Carlo, Bayesian inference. Dynamical Systems, 
mMALA 



1. Introduction 

Sequential Monte Carlo (SMC) techniques are very much the sampler du jour for a wide range of 
tasks traditionally tackled using Markov chain Monte Carlo (MCMC) methods [2 [2]. In the context 
of sequential Bayesian inference, for example, these include posterior sampling and model selection. 
The distinguishing characteristic of SMC sampling algorithms, as the name suggests, is a sequence of 
intermediate distributions {7ri(a;i), 7r2(a;2), . . . }. In early applications of SMC methods (e.g. particle 
filters), such sequences would typically consist of non- homogeneous distributions defined on supports 
of different, often increasing, dimensions (i.e. dim(A'a) > dim(Af,) for a > b, where Xa G Xa). Also, 
the algorithm outputs are usually samples from every distribution in the sequence. In more recent 
incarnations of SMC algorithms, in particular those viewed as alternatives to MCMC methods, these 
intermediate distributions only play a collective supporting role of bridging the gap between a tractable 
initial distribution and a target distribution of interest; it is also often the case, for example in tempered 
sequences, that the entire sequence is defined on a single support. 

It is the presence of a common support which, together with the freedom of constructing the set of 
intermediate distributions, throws up an important design consideration for optimising the efficiency of 
the algorithm. The entire sequence, book-ended by the initial and target distributions, now defines a 
path in the space of distributions. As a proxy to the task of minimising the cumulative distances between 
the distributions in the chain, one seeks the 'shortest' continuous route between the initial and target 
distributions. This is not unlike the requirement in sequential importance sampling (SIS) for adjacent 
importance distributions to be 'similar' to each other. The overall aim of this paper is to explore two 
aspects of this design guidance. The first aspect is the identity of the sequence itself, while the second 
is the perturbation methods used to move along this prc-specified sequence of distributions. Apart 
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from several trivial examples where an analytical solution for the shortest path, or geodesic, exists (e.g. 
sampling from a multivariate normal distribution), consideration of this first aspect is usually restricted 
to the constrained problem of selecting the optimal spacing of distributions along a given, possibly 
arbitrarily chosen, path. For example, in a sequence of p tempered distributions 



where T = {1, 2, . . . and / some specified parametric function of x, selecting the optimal sequence 
simply involves tuning and selecting an optimal sequence of parameters {0a}aeT- 

The specification of geodesies, together with the concept of movement in distribution-space strongly 
suggests that adopting a geometrical angle may provide a unifying framework for the consideration of the 
above design issues. Indeed such a geometrical framework was first introduced to MCMC in [3] where 
the authors employed ideas from information geometry to the design of efficient MCMC kernels and 
mooted its extension to population Monte Carlo (PMC) methods. Other recent examples of statistical 
applications of Riemannian geometry are its use in increasing the efficiency of Variational Bayesian 
methods [4 and in sensitivity analysis in stochastic models 15,, amongst others. 

Far from being an independent approach, SMC, together with other PMC methods, is often described 
as parallel MCMC, in part because the sampled variables, or particles, are typically perturbed along the 
sequences of intermediate distributions using MCMC proposals and are accepted or rejected based on 
the standard evaluation of the Metropolis-Hastings (MH) ratio. Seen in this context it is not surprising 
for theoretical developments in MCMC algorithms, specifically in the design of efficient kernel proposals, 
to find their way into SMC methods. In fact, as it has been demonstrated in [3], the well-documented 
advantage of PMC over MCMC samplers in addressing the issue of multiple distribution modes is 
preserved when adopting the differential geometric kernels. This paper follows this trend where we 
extend to the SMC sampler of Del Moral et al [21 the work of Girolami and Calderhead [3] . 

The elegance of a geometrical framework is very often the sole justification for its construction. 
However, in the case of the SMC sampler, the benefits extend beyond mere aesthetics; geometrically 
optimal placements of distributions and perturbation kernels will result in improved particle acceptance 
rates with greater effective sample sizes, which in turn lead to more robust and, hence, reliable sampling 
statistics over the course of the algorithm. 

The rest of this paper is organised as follows. In section[2]we provide a brief review of the theoretical 
background to our work - namely SMC samplers and the relevant aspects of information geometry. In 
section [3] we consider geodesies on statistical manifolds and adapt the differential geometric MCMC 
kernels for use in SMC samplers. We illustrate the latter application to sequential Bayesian inference by 
way of three examples. We start with a trivial example involving the univariate Gaussian distribution 
and follow up with simulations involving the Fitzhugh-Nagumo and Lotka-Volterra ordinary differential 
equation (ODE) models. We conclude in chapter |4] with a summary and discussion. 

Note on notation. We have adopted the Einstein summation convention. Also, components of a metric 
g and its inverse are written with covariant {gij) and contravariant indices ((?*■') respectively, such 
that g^^gkj = with the Kronecker delta. Where there is no ambiguity, derivatives are abbreviated 
in the usual form as di = for a given variable Component-wise vector multiplication which maps 
M" X M" M" is specified by the * symbol, where (u * v)^ = UiVi (no sum), with Ui and Vi the ith 
components of vectors u and v respectively. 

2. Theoretical background 

2.1. Sequential Monte Carlo Samplers 

There are, at present, several different implementations of SMC algorithms (see, for instance, [HIT]). 
With regards to the ability to admit intermediate distributions defined on a common space - a necessary 
requirement of our attempt to build a geometrical framework - and the fiexibility to define general moves 
between populations, it is the SMC sampler implementation by Del Moral et al [2 which proves the most 
natural. We adopt this version in this paper and, in accordance with its authors, refer to it throughout 
simply as the SMC sampler. 
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Similar to SIS, the SMC sampler is used to sample successively from a sequence of distributions. 
Unlike SIS, however, the SMC sampler does not require one to calculate the intermediate importance 
distributions. This is achieved through the introduction of artificial backward-in-time Markov kernels 
La alongside the usual forward- in-time versions Ka, the effect of which is a reduction of the complexity 
from 0{N'^) to 0{N), where N is the number of sampled particles. 

Let TTp, p e Z, be a target distribution for sampling. One specifies a preceding sequence of distri- 
butions {tti, 7r2, . . . , TTp^i}, with all {TTajaeT defined on the support of iTp. Let {?7a}aeT represent the 
sequence of importance distributions where only the initial importance distribution 771 is explicitly spec- 
ified; the most natural choice is rji — tti. In practice, the normalisation constants are unknown and one 
works instead with the unnormalised sequence {7a}aeT, where 

T^a^^, ZaeR. (2) 

One samples a population of iV particles {Ci"''}ng{i.2,...,jv} from the importance distribution r\a by 
perturbing particles from the previous population via the transition kernels Ka.{^'' \ ^a-\) ~ ^-S- local 
random walks, Gibbs moves, etc. In this paper we use MCMC kernels for these local moves, the 
advantage being that the guaranteed convergence to each tTq allows for several simplifying approximations 
to be made in the algorithm (see [2]). In particular, one has a simple approximation of a near-optimal 
expression of in terms of Ka and tTq, which, in turn, simplifies the sequential updating of the particle 

in) 

weights - i.e. the incremental factor to update the weights of a particle ^^„\ is 



(3) 



The SMC sampler is summarised below (Algorithm [ij . For more details of the construction of the 
algorithm and proofs of consistency, we refer the reader to 0. 

Algorithm 1: The SMC sampler with MCMC kernels 
Input: No. of particles per population N ^ Sequence of distributions tTq, a = 1, . . . ,p. Effective 

sample size (ESS) threshold T. 
Output: Sampled particles ^p. Particle weights W^, i — 1, . . . ,N . 

1 Initialise a = 1, ESS = iV; 

2 Sample particles ^J"' from prior tti; 

3 Set weights w[''^ = 

4 for a < p do 
if ESS < T then 

Resample particles {^'^^1} from weighted multinomial distribution {{S.'^i, 



5 

6 
7 

8 
9 
10 

11 

12 
13 
14 
15 
16 



Reset weights = ^, Vi = 1, . . . , A^; 

end 

for n = 1,2, . . . , A^ do 

Draw ^i"^ - Kai- \ where Ka is a MCMC kernel; 

Evaluate incremental weight Wa (Ci"''i , Ca"^ ) — ; 

Update particle weight = W^_^ ■ ?I;a(d"\: d"^); 
end 



Normafise particle weights = W^/J2Z=i W^; 
Calculate ESS = 1^" h 

Set a = a+1; 

17 end 

18 Return {^^"\ W^p"}ie{i,...,Ar}- 



For the remainder of the paper, where there is no ambiguity, we drop the particle index and simply 



write ^i"-' ^a- 
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2.2. Information geometry 



The premise of all of information geometry is the observation that the space of probability distri- 
butions on a set X can be viewed as a Riemannian manifold with a unique metric g |8j. Clearly, this 
is an infinite-dimensional manifold. In most applications, however, one considers the projection to a 
finite-dimensional submanifold S by restricting to a specific statistical model of choice as represented 
by a set S of parameters. By expressing the set of coordinate functions on the manifold in terms of 
model parameters ^ G 2, it is easy to conceptualise the manifold structure of S. For example, the space 
of two-dimensional Gaussian distributions M{p,, S) is a five-dimensional manifold with (non-canonical) 
coordinate functions {/Ui, /X2j S12, S22}. 

The coordinate-specific metric gij{(,) on 5 = {p{x;(,) | ^ G S} is defined by the expected Fisher 
information matrix 

g,, iO := [d,£M] = / ^^e{x; Op{^; dx, (4) 

where i{x; ^) = logp(a;; ^), and is the expectation with respect to the probability distribution p{x; ^). 
Using the property / p{x; £,) dx = 1, one derives the useful alternative form of the Fisher metric 

9^J{0 = -Ed^^^A]■ (5) 

Apart from the invariance of gij under a sufficient statistic map, the foundations of information 
geometry contain few other constraints. Specifically there are no restrictions on the class of statistical 
models or on the identity of the measure space X, which can be discrete or continuous, thereby explaining 
its wide applicability across many different fields. For example, in systems biology contexts, we often 
have 

X = (R+f' X {R+f^ X • • • X (E+)'^^ (6) 

where each sample x € X represents a set of non-negative measurements (species abundance, chemical 
concentration, etc) of S separate biological entities, each measured at Tg separate time-points (s — 
1,...,5). 

With the metric gij, one proceeds to import the concepts and structures of Riemannian geometry 
like distances, geodesies, curvature, etc to a wide range of statistical analyses (see [8] for a concise 
introduction to the relevant aspects of Riemannian geometry). For example, we can define parallel 
transport on the space of distributions using the Levi-Civita connection V^.c^j — T^^jdk, where the 
Christoffel symbols F are expressed, in the usual way, in terms of the Fisher metric and its derivatives 
as 

T^^-lg^'id^g^i+d^gu-dig,,). (7) 
Using this expression we have the geodesies 7 : M — > 5 on 5 as solutions to the geodesic equation 

7'(i)+f(t)f W(r^),(t)=0, (8) 

where the dot refers to differentiation w.r.t. t £ R, and the indices indicate the ith component in the 
given coordinate parameterisation, i.e. 7*(t) etc. 

In information geometry, the infinitesimal distance between two distributions is directly related to 
the Kullback-Leibler (KL) divergence KL{- \ ■). With ds^ = g.jd^d^^ , we have [5] 

KL{p{x-^ + dO\p{x;0)^lds'- (9) 



3. Information geometric design of SMC samplers 

3.1. Geodesies on statistical manifolds 

In many implementations of the SMC sampler, the sampling of each intermediate distribution is 
optimised by minimising the KL-divergence from the previous distribution in the sequence. This is 
often represented symbolically as tTq « tTq+i, where a = 1,2, ... ,p. Together with the relation in (|9|, 
this suggests that, in the asymptotic limit p — >■ 00, the full SMC sampler is optimised by selecting the 
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intermediate distributions to lie on the geodesic connecting the fixed initial and final target distribution 
boundary points. We illustrate this by way of the following trivial example. 



Sampling from a univariate Gaussian distribution. Assuming that one has a process of sampling from 
the standard normal A/'(0, 1), one can use the SMC sampler to sample from an arbitrary distribution 
Af{n' , c'^). We consider sequences of 25 distributions on each of three separate paths on the univariate 
Gaussian manifolcj^ with one of the paths being the geodesic. The analytic solution to the geodesic 
equation Q with fixed boundary points is provided in Appendix A.l We compare the optimality 
of the SMC sampler by tracking the effective sample size as the SMC sampler progresses through the 
intermediate distributions. 

In the coordinate representation ct^), we select arbitrary initial and target distributions pi = (0, 1) 
and P25 — (5, 3) respectively. We employ a local random walk kernel with a fixed uniform proposal. Now 
in the SMC sampler algorithm (Algorithm [I]) , the expression for the incremental weights Wa is a good 
approximation for MCMC kernels only if the successive populations are close, i.e. iia-i ~ T^a (see [2] eq. 
31). However, given that the effect of altering the distances between distributions is what we are aiming 
to test, there may be instances where this assumption is no longer valid. In its place, therefore, we use 
the full expression (|2] eq. 26) and particle approximation 



7a (Ca) 

/^11^7a-l(C)i^a(^a 10 



7a (Ca) 



(n) 



(10) 



where the full analytic expression for the kernel density Ka{^a^ \ Ci"\) is given in Appendix A. 2 



Without resampling, we track the average progression of the mean effective sample sizes (ESS) across 
ten independent iterations of the SMC sampler. The results, together with the paths on the manifold, 
are shown in Figure [l] It is clear that selecting the distributions to lie on the geodesies leads to a slower 
decline of the ESS over the course of the SMC sampler. 





5 10 15 20 
Population sequence number 



25 



Figure 1: Left: Sample paths between initial distribution A^(0, 1) and the target distribution A/^(5,3), each with 25 
intermediate distributions (including end-distributions). Right: Evolution of ESS over 25 populations. The circular 
points (red) represent the distributions on the geodesic between the fixed boundaries; the triangles (green) indicate the 
naive straight-line path which implicitly (and erroneously) assumes a Euclidean metric; the diamonds (blue) indicate a 
two-staged path with the coordinate the first to be fully adjusted. 



In practice, one would simply transform the samples from the standard normal x — >■ xa' +^! . However, 
the availability of analytical solutions to the geodesic equation for Gaussian distributions presents an 
opportunity to employ the SMC sampler and to observe the effect of locating distributions on geodesies. 
Unfortunately, such analytical solutions are almost always unavailable. Indeed, as mentioned in the 
introduction, the procedure of writing down a tempered sequence ([I]) simply side-steps the issue by 
adopting the possibly non-optimal solution of fixing all but one of the parameters to the target distribu- 



^Note that no attempt was made, on either geodesic or non-geodesic paths, to ensure equal Fisher distance between 
all pairs J°^^ ds, as was done in [10] . Should we choose to space the distributions evenly, we fully expect the relative 
performance of the three paths, and our conclusions, to remain unchanged. 
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tion values. Effectively, this is equivalent to taking a one-dimensional submanifold, with parametrising 
coordinate 4>, in place of the true geodesic, and then attempting to atone for our sins by selecting a 
sequence of {0a}aGT which minimises the cumulative distances for that particular path. One can also 
adopt a brute force approach to reducing the distance between each intermediate distribution by simply 
selecting a large number of intermediate distributions. 

3.2. Differential geometric MCMC kernels for SMC transitions 

A non-optimal placement of intermediate distributions in SMC can be mitigated somewhat by em- 
ploying an efRcient kernel for transitions between distributions. In this section we describe one such 
method - the manifold Metropolis-adjusted Langevin algorithm (mMALA) [3]. mMALA is an algo- 
rithm which prescribes, using the local geometry of the statistical manifold, the discretised flow of a 
particle on a given space, whereby the sampled positions of the particle over time follow a specified 
distribution. 

The mMALA MCMC kernel combines a discretised Brownian motion with drift proposal with the 
standard Metropolis-Hasting acceptance step. Let ^ G M.^ be a random vector parametrising distribu- 
tions p{x;^), X Cz X. As described above, we treat ^ as the coordinate functions of a _D-dimensional 
manifold S. Define p{£,) to be a densitjj^ on S. In the special case where the target distribution is 
proportional to the likelihood of the set of S observed samples {xg} from X, i.e. p{S^) oc nf=iP(^s!Oj 
one can calculate the Fisher information matrix g and track the discretised Langevin diffusion on the 
Riemannian manifold {S,g). As shown in 3J The proposal density is the Gaussian 

gier+i Ur)^AA(/i(e.,e),eV), (11) 

with e the Langevin diffusion discretisation step size, r the discrete time index, and where the components 
of the deterministic drift are given by 

where £{^) = logp(^). 

The geometric design of efficient transition kernels for SMC is easily adapted from the MCMC 
context. Consider a sequence {pa{0}a€T, defined on a common space S. Where in MCMC a single 
metric gij{C) is defined from the MCMC invariant density, here we have a sequence of metrics {(5a)ij}aeT- 
Concomitantly, the set of proposal densities ( [Tl| ) is replaced by a sequence of proposal densities 

{qa{^a+l\U}=^f{^^a{^a,e),e'g-')} , (13) 



where fj,a{^a,^) is given by (12) with g replaced, in sequence, by the metrics in {ga}aeT- 

Assuming that Pa ~ Pa+i we expect a relatively high acceptance rate in the MH step. We can, there- 
fore, get away with performing relatively fewer MCMC iterations (perhaps even just one per distribution 
in the sequence). Although this reduces the overall computational cost of the algorithm, it is dependent 
on the quality of the mixing property of the kernel, the latter of which can be improved, albeit at the 
expense of a decrease in acceptance rate, by choosing a larger Langevin diffusion step size e. 

Apart from the choice of the initial distribution, there are two key algorithmic parameters to set: 
the discretisation step size, e, and the number of populations, p. We demonstrate the application of the 
mMALA — SMC combination and observe the implications of various parameter combination choices by 
way of the following simple example. 

3.3. Example: Univariate Gaussian parameter inference 

Consider S observations, Xi, drawn from a univariate normal distribution A/'(/i, ct^) with parameters /i 
and cr to be inferred using the SMC sampler. Adopting ^ — (^1,^2) = (/^7 cr) as the coordinate functions 



^Note that this density is an additional structure on the manifold; p(5) is defined on S, whereas for s S 5, p(x; £,{s)) is 
defined on X. 
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of a manifold S, we assume the normal priors 7ri(^i) = N{ui,v\) and 7r2(C2) = ■Af{u2,V2) for some 
ui,vi,U2,V2 € We define a sequence of p tempered distributions [11] {paiO}aeT, T = {l,...,p}, 
where 



s 



2^1 



(14) 



with 



< (/)2 < • • • < <^p-l < 



1, and 7r(^) = diag{TTi{^i),'K2{^2))- It can be shown that for 



tempered distributions of the exponential form (e.g. (14)), the sum of symmetrised Kullback-Leibler 
divergences between the distributions is minimised by adopting a geometric tempering sequence where 
0a+i/0a = const |12j . Therefore, in the examples that follow, we adopt the geometric sequence with 



= 02 



and (pi — 0. 



(15) 



In addition, in this example, we fix 02 = 5 x 10 ^. 

Using ([5]) the metric for each distribution in the sequence is given by 




1 



(16) 



For step-size e, the mMALA drift and the covariance matrix of the diffusion term on S are, respectively. 



e 
2" 



C2 



. $2 -"2 



S01 
?2 



2S4>, 

C2 



V C2 2Ci y 



(17) 



C2 



(18) 



with the Ci = and C2 = + 2w|S'0£i. An important role played by the prior scale parameters 

vi and V2 is that of setting a lower bound on the components of the metric ga ■ The implications of this 
effect are examined in Section [3.41 



Parameter inference. We sample 60 points from the distribution J\f{^',a''^) = J\f{50, 10^) and let the 
prior distribution be centred on (/i',cr'), i.e. ui = 50 and U2 = 10, with vi = 20,^2 — 2.5. Selecting 
a sequence of 45 populations and using 1500 particles, we run the SMC sampler with e = 0.4 and 
resampling fraction T = 0.3. The estimates of the joint and marginal posteriors of fx, a are presented in 
Figure [2j As expected, the SMC sampler with mMALA transitions lead to a posterior which is centred 
around {fi, a) = (50, 10). 



mMALA drift. In order to visualise the path of the particles and to understand the effects of varying the 
parameters p and e on the effectiveness of the SMC sampler, we focus on the deterministic perturbation 
at each intermediate stage by setting the diffusion term to zero and accepting all particles without 
resampling. We select 24 points on the manifold S from a grid surrounding the sample and observe the 
drift ( 12 ) for varying numbers of intermediate populations p and step-sizes e. The plots of the paths in S 
are given in Figure|3] We can already deduce several features of the mMALA drift. For a fixed step-size, 
e, a minimum acceptable number of populations for the SMC run with Pmm ^ 45. If p < 45, the drift 
is too weak and the kernels do not mix well; if p > 45, the particles do indeed drift toward the target 
coordinates with the variance in particle end-points decreasing as p increases. If e is too small, the drift 
is too weak, resulting in a poor performance of the SMC sampler; if e is too large, the sharp changes 
in drift direction is an indication of the expected breakdown in the first-order Euler discretisation of 
the Langevin diffusion. The implication of this last point for SMC samplers is that a higher proportion 
of particles will be rejected in the MH-step leading to a lower number of effective particles used in the 
sampler. These results support the intuitive expectation that there is not just a minimum step size and 
total number of perturbation steps for effective sampling, but also a negative impact on the efficiency of 
the sampler when these tuning parameters are too large. 
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Figure 2: Left/middle: Weighted histogram of particles representing the estimated marginal distributions of the mean 
and standard deviation parameters. Right: Scatterplot of particles over 45 populations for a simulated univariate model 
A^{50, 10^). The initial particles are sampled from the prior n{^) = poiO ~ diag(J\f{50, 20^), A^(10, 2.5^)) and are coloured 
black in the scatter plot. The particles of the intermediate populations are represented in increasingly brighter shades of 
green with particles of the final population, representing the posterior, coloured red. 
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Figure 3: mMALA particle drift paths for the univariate normal model for different numbers of intermediate populations 
p (top row), and step sizes e (bottom row). The three population sizes and step sizes simulated are p = 10,45,500, and 
e = 0.1,0.4,0.7 respectively. The purple triangle marks the simulated mean and standard deviation {fi',a') = (50,10); 
the blue points represent the initial 24 particle samples and the red points the perturbed particles at the final population. 



The results from this simple example hints at a strategy for optimising the differential geometric 
kernels of the SMC sampler: keeping constant the product of the square of the step-size and population 
number, i.e. 

e^p ^ const, (19) 

we seek to minimise the number of populations whilst ensuring that the step-size is not so large that 
the precision of the algorithm is impacted by an increased rejection rate. For example, we found that 
the drift behaviour with e = 0.4, p = 45 is essentially identical to the set-up with e = 0.2, and p = 180 
(0.4^ X 45 = 0.2^ X 180). Without factoring in the particle acceptance rate, the efficiency of the algorithm 
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simply scales with the number of populations. 

Other than the tuneable parameters, p and e, it is the identity of the starting distribution, e.g. the 
prior in the context of Sequential Bayesian inference, which has a big impact on the viability of the 
mMALA transitions for SMC samplers. We defer this discussion to the following section where the 
implications are particularly significant. 



3.4- Application to parameter inference in dynamical systems 
Following [31 , we consider a system of differential equations 

i(<) = /(x,C,i), (20) 

where x is a D-dimensional vector, ^ the model parameter vector, and t the time variable. The observa- 
tions y of X are made of the underlying state at r time-points {ti, t2, ■ ■ ■ ,tr} with a known observation 
noise model, i.e. 

Y = X + E, (21) 



where X — X{xq,^) = (x(<i)|x(t2)| ■ ■ ■ I'^itr))'^ is a solution to (201, and E the observation noise. Y, 
X, and E are (r x £))-dimensional matrices. For given initial conditions xq and prior 7r(^), the task is 
to estimate the posterior 

p(C I Y, xo, E) cx 71(0 • L{Y 1 1 xo, E), (22) 

where L(- \ •) is the likelihood function. For example, as described in for a noise model given by a 
time- independent normal distribution with variance cr^, d G {1, . . . , D}, we have 

D 

L{Y I xo, E) = l[N{X{^, xo).,d, Sd) , (23) 

d 

where — irO'di ^^'^ -^ .d denotes the time-series vector for species d. For a log- normal noise model 
with corresponding parameters, we have 

D 

L{Y\^,x„,E)^l[logM{X{^,^Q).^d,^d). (24) 



Observation noise in realistic dynamical systems are, however, likely to be more complicated. The 
only consequence of deviating from the above vanilla noise models is an added complexity in the Fisher 
metric calculations. As an example of the additional algebra involved, we have examined the implications 
of two such modifications - heteroscedasticity and a truncation of the normal noise model. The details 
of these Fisher metric calculations are collected in [Appendix C.l| Nevertheless, since there are no 
conceptual novelties arising from cases with such adjustments, we have simply adopted the normal 



model of ( 23 1 for the remainder of this section. 



The problem can now be given a geometrical construction. We treat the model parameters f as coor- 



dinate functions on a manifold S. The combination of the differential equations (20) and an observation 
noise model allows us to associate to each point in 5 a probability distribution as follows. Define the 
measure space 

A" (M^ X X • •• X M^). (25) 

D terms 

The density p{y.,n]0 given by the posterior in ( [22| , i.e. 

p(K,„; - «7r(0 • L(K,„ | C, xq, E), (26) 

where k G M is a constant. Using this density we can then proceed to define the Fisher metric gij{£,) on 
S via ([5|. For both the Gaussian and log- normal noise model, the likelihood has an exponential form. 
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Hence, abbreviating p{x;(^) = K7r(^) exp <I>(^), we have 



-9,9,(logp(K,„;0) =-a^9,<& 



AO 



7r2(C) 



(27) 



where hij{^) depends only on ^ via the prior 7r(^) and not on Y. We evaluate hij{^) for several typical 
priors - uniform, multivariate normal (MVN), and the component-wise (CW) log-normal - a^ 



Uniform: 
MVN: 



CW log- normal: 



%(e) = o, 

(5, 



(1 - log r + /i' 



(28) 
(29) 

(30) 



where the /i^ and S^j are usual components of the mean vector and covariance matrix in dim(5)- 
dimensions; we also assume that for the uniform prior, hij{^) is evaluated away from the boundary of 
the non-zero support where diTT{^) is undefined. 



In the framework of SMC we now define, on the shared measure space (25), the sequence of p 
distributions 

Pa(x;0 = «7r(0- [L(r|e,xo,i?)]^", (31) 

with a e T and = < (j)2 < ■ ■ ■ < (pp — 1- From this sequence of distributions we have a matching 
sequence of Riemannian manifolds {{Sa, ga)}aeT- Similar to the evaluation in [3 , the Fisher metrics 
{ga)ij and their first and second derivatives w.r.t. ^, evaluated using ([s]), can be written for the normal 
noise model ( 23 ) as 



D 



d=l 
D 



dkiga) 



where the sensitivity S is defined as 



Si.d 



dXrf 
d^* ■ 



(32) 
(33) 

(34) 



Details of the derivation of (32 1 are given in Appendix C.l Following we evaluate S and its 
partial derivatives for all sampled time points via the numerical solutions to a set of auxiliary differential 



equations obtained by repeatedly differentiating (20) w.r.t. ^. These auxiliary equations are collected 
in [Appendix 63] 

We now explore the application of this methodology with two examples: the Fitzhugh-Nagumo and 
the Lotka-Volterra model. 



Example: Fitzhugh-Nagumo model. The Fitzhugh-Nagumo model was developed as a simplification of 
the Hodgkin-Huxley model, the latter describing the dynamics of the potentials of a spiking neuron. 
The dynamics of the voltage V and response R variables is modelled by the following set of ODEs 



dV 
~dt 
dR 



civ 



a~V~bR 



Rl 



(35) 
(36) 



Note: the indices are not summed over in the expression for the log-normal prior. 
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where a,b,c are the parameters of the system. Following the treatment in [5] and [13], we simulate 
the system with starting values (Vb,i?o) = (— Ij 1) and parameters {a',b',c') = (0.2,0.2,3). Assuming a 
fixed, identical (over all time-points), normal observation noise model for both potentials with = 0.05, 
we make 25 observations and attempt to infer the parameter values using an SMC sampler with mMALA 
transitions. 

Using N = 1000 particles, a discretisation step-size e — 0.6, and a resampling threshold of 0.3, we 



perform the algorithm over 50 tempered distributions (31) in a geometric sequence (15). We adopt 
component-wise normal priors for the three parameters. The prior is centred on the simulated mode, 
i.e. (/ia, A^b, A^c) = (0.2,0.2,3), and we set the covariance matrix to be E = diag {0.3^, 0.3'^, 1.5^). In 
addition we impose a positivity constraint on the parameters. The derivatives of V and R, required 
for the calculation of the Fisher metric, are given in [Appendix B.2 We present the scatter plots 



and weighted histograms representing the full and marginal inferred posterior in Figure |4] and plot the 
resulting expected ODE solutions in Figure [3) The results verifies the applicability of the SMC sampler 
to dynamical systems parameter inference. 





c 



2.75 2.81 



2.94 3.01 



Figure 4: 2D scatterplots and marginal distributions of the three parameters a, b, c of the Fitzhugh-Nagumo model over 
50 populations. The initial particles, coloured black in the scatter plots, are sampled from a component-wise normal prior 
with a positive truncation. Particles of the intermediate populations are represented in increasingly brighter shades of 
green with particles of the final population, representing the posterior, coloured red. 



We turn now to the issue of the identity of the prior distribution 
have already alluded to at various points in our discussion (c.f. (16) and Section 3.3). 



the importance of which we 
Similar to 



the simulations of the static Gaussian example in the previous section, we select an arbitrary grid 
of points about the simulation mode and focus on the deterministic drift by ignoring the diffusion 
steps and accepting all particles. We compare the behaviour arising from the component-wise normal 
prior given above and with that from a component-wise uniform prior with lower and upper limits 
{ai,bi,ci) = (0,0,0) and {ay_,hy_, c^i) = (1,1,7) respectively. The plots of the particle drifts are given 
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Time Time 

Figure 5: Voltage (V) and Response (R) observations (dots) and inferred time-series curves (lines). The expected time- 
series curve is obtained by taking the weighted mean of the inferred curves across all particles from the final population. 

in Figure [6j It is clear that, unlike the case with the normal prior, the SMC sampler with a uniform 
prior does not lead to an orderly drift of particles toward the simulated mode, which results in a highly 
inefficient sampler where the particle acceptance rate at each MH-step is extremely low (< 1%). This 
occurrence is unfortunate, if not entirely unexpected for the following reason. The procedure of utilising 



tempered distributions implies that the components of the Fisher metric ( 32 ) corresponding to the initial 
distributions with low values of the tempering parameter c/fa are bounded from below by a function of 
the prior parameters, i.e. for small (jja, 

{9ah ~ (37) 

For a uniform prior, hij — and hence {ga)ij ~ for all i, j. Given that both the drift and diffusion terms 
((11), ( |12[)) are proportional to the inverse metric g~^, the observed high-temperature drift behavioui]^ 



in Figure is fully consistent with the theoretical expectations. 

In our simulations, it turns out that the SMC sampler with mMALA transitions is not much more 
robust than one with a non-differential geometric global adaptive kernel, the latter with the computa- 
tional benefit of not requiring costly computations of the Fisher metric. We speculate on the possible 
reasons in the next section. However, we now consider another example where the performance of the 
information geometric SMC sampler shows a clear advantage over a standard adaptive kernel. 

Example: Lotka- Volterra Model. The Lotka-Volterra model is a simple representation of the predator- 
prey relationship in a closed environment. Let the variables x, y represent the numbers of prey and 
predator species respectively. The dynamics are represented by the set of ODEs 

^=x(«-/3y), (38) 

'^^=-y[l-5x), (39) 

where the four parameters (a,/3,7,(5) represent the prey birth rate, prey death rate due to predation, 
the predator death rate, and the predation efficiency respectively. 

We set up a simulation with starting population (a;o,yo) — (15,30) and parameters (a', /?', 7', (5') = 
(8,0.5,0.2,0.01). We adopt the same assumptions as for the Fitzhugh-Nagumo model simulation but 
with the normal noise cr^ = 0.4, and prior parameters centred on the simulated modes with S — 
rfia5(2^, 0.1^, 0.05^, 0.004^). We let the number of particles N = 1000, the discretisation step-size e = 0.5, 



and a resampling threshold of 0.3. We perform the algorithm over 30 tempered distributions (31) in a 
geometric sequence (15). In Figure [Tj we show the scatter plots and marginal weighted histograms of 



the inferred posterior. It is clear from the scatter plots that the parameters are very highly correlated 



^ , where T is the temperature. 
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Figure 6: mMALA particle drift paths for the Fitzhugh-Nagumo model. Each of the three rows represent a projection to 
a separate two-dimensional parameter subspace; the left and right columns are simulations with a uniform and a normal 
prior respectively. The lack of a lower bound on the Fisher metric in the simulation with the uniform prior (see text) 
results in a drift behaviour consistent with an extremely high temperature regime, rendering the SMC sampler algorithm 
unworkable. 

precisely the regime where the information geometric approach is expected to confer the greatest benefit. 
We verify the accuracy of the sampler by simulating the curves with the particle parameters, as shown 
in Figure [8j 

To demonstrate the efficiency of the differential geometric SMC kernel we benchmark its performance 
with a non-differential geometric, but nevertheless adaptive, kernel. We consider an MCMC kernel with 
an MVN proposal K'{^a+i \ S,a) with zero mean and a covariance matrix set to the sample covariance 
matrix of the particles in population a multiplied by the asymptotic factor (2.38)^/D [T3]. Here D is 
the dimension of the parameter space (i.e. D — A). 

The one measure of efficiency of practical importance we would like to examine is robustness; in the 
context of parameter inference this means a low variability in the inferred statistics over repeated runs of 
the SMC sampler. To that end we run, for both the mMALA and the above benchmark kernel, the SMC 
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Figure 7: 2D scattorplots and marginal distributions of the four parameters a, /?, 7, S of the Lotka-Volterra model over 
30 populations. The initial particles are sampled from a component-wise uniform prior with limits as shown in the 
figures. These particles are coloured black in the scatter plot. Particles of the intermediate populations are represented in 
increasingly brighter shades of green with particles of the final population, representing the posterior, coloured red. 

sampler for a range of total intermediate populations, performing 27 repetitions of each run. Over each 
set of 27 runs we determine the sample variance of the inferred parameter means. For both kernels, one 
would expect the sample variance to decrease with increasing total number population. This is, indeed, 
what we observe, and the results for parameter a are given in Figure [9j We see that the mMALA kernel 
outperforms the non-differential geometric adaptive kernel by demonstrating a consistently high level 
of robustness over different number of intermediate distributions. By employing the mMALA kernel, 
we have effectively shortened the distance between successive distributions in the SMC chain, thereby 
replicating the algorithm with a greater number of intermediate distributions. 

An alternative, explanatory, view of the efficiency of the mMALA kernel can be seen by tracking the 
effective sample sizes and particle acceptance rates of the SMC sampler over the intermediate populations. 
This is shown in Figure [TO] The relatively gradual decline in the ESS of the SMC populations when the 
mMALA kernel is employed is similar to the observed behaviour ESS along the geodesic in Figure [T] 

4. Summary and Discussion 

In this paper we have explored the application of methods and ideas from information geometry 
to SMC sampling and have demonstrated its use in sequential Bayesian inference. We focused on two 
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Figure 8: Predator and prey observations (dots) and inferred time-series curves (lines). The expected time-series curves 
are obtained by taking the weighted mean of the inferred curves across all particles. 
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Figure 9: The variance in the estimate of the a parameter of the Lotka-Volterra model as given by SMC performed using 
mMALA proposals (red •) and non-differential geometric global adaptive kernel (green The profile of the estimates 
of the other three parameters (/3, 7, 5) are similar. 



areas - the construction of the sequence of distributions which lie on geodesies, and the employment of 
mMALA as MCMC transition kernels between the intermediate distributions. 

Of the two areas, the theoretical foundations of the mMALA is more established and we have shown 
that the extension from MCMC to SMC is relatively straightforward. In particular the theoretical 
differences are the replacement of the single Fisher metric with a sequence of Fisher metrics defined by 
the sequence of distributions, and the parallelisation of the mMALA diffusion over multiple particles. 
The conceptual similarity with its use in MCMC allows all the advantages of the original mMALA 
formulation in Girolami and Calderhead 3 - namely the self-tuning characteristics, efficiency in highly- 
correlated and high-dimensional settings ~ to be carried across to SMC. 

The issue of intermediate distributions on geodesies is, in comparison, more difficult to tackle, simply 
due to the lack of analytical solutions for most non-trivial examples. Nevertheless we have demonstrated 
the potential advantages, in the form of a slower decline in the effective sample sizes, of placing our 
SMC intermediate distributions on geodesies by way of a simple univariate normal inference example. 
The computational price of evaluating these geodesies (where available) is a one-off cost incurred at the 
start of the SMC run and is independent of the number of sampled particles. 

At present the computational overhead involved in determining the components of the Fisher metric 
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Figure 10: The Metropolis-Hastings acceptance rates (left) and effective sample sizes (right) over a representative single 
run of the SMC sampler employing the mMALA kernel (red •) and the non-differential geometric adaptive random walk 
kernel (green ♦). The horizontal dotted line indicates the 30% resampling threshold. Employing the mMALA kernel results 
in a higher MH-acceptance rate and a correspondingly more gradual decline in the ESS. 



is seemingly a disadvantage. In the application to the analysis of dynamical systems, one is required 
to solve at least two separate sets of differential equations for the sensitivities and their derivatives; 
for heteroscedastic observation noise models, the burden is compounded by the need to solve for the 
double derivative of the metric. Because the computational cost of evaluating the Fisher metric scales 
as 0{D^), with D the dimension of the statistical manifold, the efficiency of the technique is reduced 
in precisely the high-dimensional regime that a non-information geometric kernel might be less suited. 
For more complicated noise models (e.g. heteroscedasticity), the computational complexity scales as 



0(-D ) - this is a consequence of the need to calculate the double derivatives didjSk (see Appendix 



|C.1| ). The computational burden can be reduced by adopting the simplified mMALA, where only the 
first-order sensitivities are calculated [31 E] ■ However, as shown in the original MCMC context [3j the 
effective computational time of the simplified mMALA, expressed in units of the ESS, is not significantly 
better than its full counterpart. Nevertheless, even in low-dimensional problems, especially in situations 
where the parameters are highly correlated, as we observed in the Lotka-Volterra model, the advantages 
conferred by the information geometric approach are clearly evident. Furthermore, it has been suggestet^ 
that the calculation of the Fisher metric can be made more efficient via a canonical transformation of 
the coordinates. This is an interesting avenue for further research. 

When employing the SMC sampler in sequential Bayesian inference, the form of the prior is often 
chosen primarily on the basis of knowledge of the parameter constraints, for example a gamma distribu- 
tion for strictly positive parameters, or a uniform prior when knowledge of the upper and lower bounds 
is available. Although such bounded priors enforce weak identifiability on the parameter estimation, 
as was highlighted in [6] and [15], the prior now has the additional role of regularising the Fisher met- 
ric. Indeed, as we have shown via the Fitzhugh-Nagumo model example, choosing a uniform prior in 
conjunction with a tempered sequence of distributions severely impacts the viability of the mMALA 
kernels; flat or nearly-flat intermediate distributions, as is the case with the initial distributions, results 
in high-temperature diffusion processes. Because the mMALA diffusion process is informed by the local 
geometry of the statistical manifold, it does not respect the global density boundaries and attempts to 
scatter the particles to fill the entire untruncated support, resulting in very low MH-acceptance rates. 

Although the transition kernels adapt to the local geometry of the manifold, there is still the out- 



^By A. C. C. Coolen in the Discussion section of [3]. 
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standing issue of tuning the discretisation step-size parameter. We have shown in the univariate model 
example that there is a minimum step-size, below which the particles do not travel far enough over the 
course of the SMC run. One possibility for improvement might be to adopt a higher-order discretisation 
approximation of the Langevin diffusion SDE which will allow us to select larger step size e without 
deviating too significantly from the Langevin diffusion path, which, in turn, leads to the maintenance 
of a high MH-acceptance rate and a reduction in the number of intermediate distributions needed for 
a given level of robustness. This can be achieved by replacing the first-order Euler discretisation with 
the higher-order Ozaki discretisation jl6j . We provide a brief description of the Ozaki discretisation in 
[Appendix C.2| At present it is not clear if the advantages are negated by the added complexity and 
the resulting increased computational burden, or if the performance of the simplified mMALA relative 
to the full version can be improved by altering the discretisation scheme. 

In summary, we see that the methods of information geometry can be applied to SMC samplers 
allowing for adaptive and efficient transition kernels. The information theoretic formulation provides a 
neat and aesthetically pleasing geometrical framework for future improvements in the algorithm. How- 
ever we have shown that niMALA kernels cannot substitute for careful monitoring of convergence, and 
potentially adjusting kernels appropriately. This is particularly challenging in areas where likelihoods 
are flat, which includes many dynamical systems [T7 | ITS l [T9]. 



Appendix A. Derivations for the geodesic example 

Appendix A.l. Geodesies on the Gaussian distribution manifold 

Consider a multivariate normal distribution (^, S) defining a manifold S. Following the analysis in 
[20l [2TI [22] , the metric on this space can be written using ([s]) as 

ds^ = {dnfj:-^dn + ^ ir(E-ME)2. (A.l) 

In these coordinates, the geodesic equations (|8| are written as 

S-H/i^^-i;S~iS = 0, (A.2) 

il-tT.-^fi = 0. (A.3) 

The strategy is to first solve for geodesies through the origin (/i(0), I](0)) = (0,lp) before translating to 
the curve with the desired end-points. Adopting the canonical coordinates {A,S) = (E^^,S^^^), (A.2) 
and (A.3 1 can be partially integrated to give 

A = -BA + xS'^, (A.4) 
6^-B5+{l + d^A-^5)x, (A.5) 

where B — A(0) and x = <5(0), and A{t) and 6{t) are the geodesic coordinates parametrised by i G M. 
It can be shown [55] that the solutions to the geodesic equations are 



A(i) = Ip + i [cosh(tG) - Ip] + ^B [cosh(tG) - Ip] {G-'^fB 
- ^smh{tG)G-^B - ^B smh{tG)G-\ 



(A.6) 



5{t) = -B[cosh{tG) - Ip] {G-'^y^x + smh{tG){G~^x), (A.7) 
where G^ :— B^ + 2xx^ . Simplifying to one dimension [p ~ 1), we have 

A{t) = 1 + i(l + i?2)(cosh(tG) - 1) - i?sinh(iG), (A.8) 

5{t) = (^—7^) ' (-i?(cosh(iG) - 1) + sinh(tG)), (A.9) 
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where R = B/G. This solution would suffice, except that it is given in terms of the gradient terms B, G 
instead of the target end-point coordinates; it is however not difficult to rewrite the solution. Keeping 
[p = 1), we set, without loss of generality, the target end-point of the geodesic to be located at i = 1. 



Let A' = A(l) and 5' = (5(1) we solve (A.8I and (A.9I for R and G giving 



G — cosh 



_ 2A'2 + 2A 
(,5'4 + 45'2 A'2 + 4,5'2 A' + 4A'4 - 8A'3 + 4A'2) s 
<5'4 + 45'2A'2 + 4<5'2A' 



8A'3 



4A'4+4A'2n 



(A.IO) 



(A.ll) 



Substituting these expressions for R and G back into (A. 8 1 and (A.9) gives us the solution for geodesies 
through the origin and a specified point coordinate at t = 1. 

Now in order to translate geodesies to geodesies, we need a map g : S ^ S which leaves the Fisher 
metric invariant. As shown in |22| the map which achieves this is the symmetric group GA'^{p)/SO{p) 
where the positive affine group is defined as 



GA+{p) := {g^{d,P) G x GL(p, M)| det P > 0}, 
and the group action on points in S, in terms of coordinates (/i, E) and (A, 6), are 



(A.12) 



S ^ S 

{A,6) ^ {{p-^fAp-\{p-Y^P''^^'^S- 



(A.13) 



APA-^S 



Ad), 



and with inverse element g ^ — {~P ^d,P ^). 



Given two points pi,P2 G S, it is then straightforward to solve for the desired geodesic. First, obtain 



the group element g' € GA'^{p)/SO{p) which maps the origin to pi. Substituting {A', 6') 



into (A. 8 1 and (A.9 1 via (A.IO) and (A.ll), we obtain a solution through the origin which can then be 



translated using g' to describe the desired geodesic. 



Appendix A. 2. Kernel density for a uniform proposal and Gaussian target 



The kernel density in ( 10 ) is the Metropolis-Hastings algorithm acceptance probability ^23j for a uni- 
form proposal of width d and a univariate Gaussian target. Following the MH-algorithm, the expression 
is calculated separately for the cases where the proposal is accepted or rejected. We have 



1 



max 



0, 



;„_i-^+a/2 



l,exp(-(%^ 



(forCa = ea-i) (A-14) 



(for L ^^a-l) 



with (/i, a) the mean and standard deviation parameters of the ath intermediate distribution. is the 
particle coordinate of population a and $ is the cumulative distribution function of the standard normal. 



Appendix B. Non-linear ODE calculations 

Appendix B.l. Auxiliary differential equations for the sensitivities 5^ „ 

The subscript conventions used in this section indicates the variable in the differential. We have 
/c} ^ ^ and {Z,TO,n} ^ X, for example djdif = Qcjgx' ■ ^-^oid cluttering the notation, when 



18 



there is no ambiguity, we drop the X-index on i.e. Si^i — ^ Si and /; — >■ /. 

^ {dif)Su + dj, (B.l) 

dkS, = {drndif)Sk^„,S,,i + {dkdif)S,,i + {dif){dkS,,i) + {mf)Sk,i + d^dkf, (B.2) 

djdkS^ = {d,ndldjf)Sk,mStJ + {dmdlf){djSk^m)SiJ + {dmdlf){djSi^l)Sk,m 

+ {didjdJ)Sk,i + {mf){d,Skj) + d.djdkf (B.3) 

+ {mdr.f)Sk,i + d,dkd,j]. 
Appendix B.2. Partial derivatives for the ODE examples 

Fitzhugh-Nagumo model. We reproduce here the expressions in [3]. The two components are labelled 
(V, R) with parameters (a, 6, c). The non-zero single derivatives are 



r.2 



dV _^ V'^ ^ OR 1 OR _ R OR _V - a + bR 
dc 3 ^ da c' db dc 

dV _ 2^, dV _ OR _ I OR _ b 

and the double derivatives 

d^R _2{a-V -bR) d^R_ 1 d^R _ R 

2 ' 



(B.4) 



d'^c ' dadc ' dbdc c 

d^V ^ ^ . d^V d^R 1 d^R b 

1 - , „„„ = 1 



(B.5) 



dVdc ' 9i?ac ' c' 9i?9c 



,2 ■ 



Lotka-Volterra model. Let (a;,?/) label the prey and predator respectively with parameters (a,/3,7, 
The non-zero single derivatives are 



dx 
da 


dx 
df3 


= -xy, 


dy 
d'-f 


= -y, 


dy 
dS 


= xy, 


dx 
dx 


Py, 


dx 
dy 




dy _ 
dx 


Sy, 


dy _ 
dy 


and the double derivatives are 














d'^x d'^y 


= 1, 


d'^x 




dH 


~y, 


d'^x 


dxda dydj 


dxdp 




dxdS 


dydp 



(B.6) 



-(7 - Sx), 



d'y ^ 
dydS 



d^x d^y 



(B.7) 



dxdy ' dxdy 



Appendix C. Extensions of mMALA 

Appendix C.l. Heteroscedasticity and bounded parameters 

In this section we present the implications on the calculation of the Fisher metric of incorporating 
additional structure on top of the normal noise model. In particular, we focus on heteroscedasticity and 
boundary constraints in observation noise. In the first case, in addition to a fixed normal observation 
noise, we have a noise contribution which scale scales with X; here the noise model is given by ( |23[ ) but 
with the replacement 

Erf := Iraj + diag{{lral){X.,d * (C.l) 



19 



In the second case, because Y often represents measurement of physical quantities (concentration, vol- 
ume, etc) there is usually a positive constraint on X and Y, the implication being that one has to 
consider truncated distributions. 



Using the expression of $ from ( 23 ) for the multivariate normal and a simple application of the chain 
rule, we have 



,1 



-didj^d = d,dj{-iYd - XdY (Yd - ^d)) 

= Sl,^-'Sd,, - {Y, - Xaf{^-\d.Sa,,) + + (C.2) 

+ \{Yd-Xaf{d.d,^-^'){Ya-Xd). 

For a normal distribution truncated between limits a < X < the expectation and variance can be 
written in terms of the untruncated mean and s.d. (/i, a) as |24| 



E{X\a < X <b)= n 



Var{X\a < X < b) = 



rb--fj.\ 



, — — O", 



1 



-)- 



(C.3) 
(C.4) 



We evaluate the Fisher metric as [ga{0]ij = E^{didj^ — hij), where hij is given by (28)~(30|. With- 
out loss of generality, we set h^j = and the Fisher metrics iga)ij a.nd their first and second derivatives 
w.r.t. 5, evaluated using ([s]), can be written for the noise model (23 1 with both the heteroscdasticity 
and positivity constraint as 



where 



D 

+ {dkdiE-YSd,j + {d,^d'idkSd,k) + {^k^,J:-')Sd,^ + {djj:^'){dkSd,i)] 
+ \ tr[idkd,d.j:^YJd) + {d,d,^dYddJd)), 

LdAj ^d^i'^'i-^dj) + {di'S^^)Sdj + {djT,^^)Sd,i, 
Jd := Kd * (ip - ttd * A(arf)). 



(C.5) 



(C.6) 



(C.7) 
(C.8) 



Specifying the fixed and heteroscedastic noise components a and ct, the three terms Kd^ \{cy.d) and ad 
are defined by 



Ed = diag{Kd), ad := Xd * 



[A(ad)] 



_ <l>{ad,i) 



(C.9) 



with Ed defined in (C.l), Xd = X^^, i.e. the c?*^ column of X, and 0(-) and <&(•) the density and 
cumulative distribution functions respectively of the standard normaj^ 

Removing the positivity constraints and setting the heteroscedastic variance term to zero, i.e. a — > 



^Not to be confused with the tempering parameter <j>a or the exponent <& in the likelihood in \27\. 
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-oo and dd = 0, we recover the expressions of the Fisher metric and its derivatives in the main text ( 32 ) 



and ( 33 1 , respectively. 



Appendix C.2. Higher-order Ozaki discretisation of the Langevin diffusion 

The Ozaki discretisation was formulated by Ozaki and Shoji [ini [23 and proposed for use in MALA 
framework by Stramer and Tweedie [55]. In contrast to the Euler discretisation, it is more stable 
|27j . providing a higher-order approximation for the drift term in the Langevin diffusion SDE. On the 
downside, its implementation can often be computationally expensive. 

The Ozaki discretisation of the Langevin diffusion SDE can be expressed as an MVN proposal, like 



the Eulcr discretisation in (11), as 

QaiL+i I L)-^{t^''{L,e),EO{^a,e)), (C.IO) 

where the components of the sequence (as indexed by a) of means /x^(^a,e) and covariance matrices 
are given by 

{^^^i^a,e)y = + {j-\expie'Ja)-lD)yV, (C.ll) 

{i:^{^a,e)f = i/(j,-i(exp(2eVa) -I,,))^', (C.12) 
where the drift term ba{Ca) is written as 

baiU = la^'dji - 9'\d,gki)9'' + ^.9'^/'5,5fe,, (C.13) 
with the Jacobian {^Ja{S,a)Y ^ — The explicit form of Jai£,a) is given by 

JaiU = -\{9'''{d,gu)9'"'d^i+^d,dke + g"'(.d,gu)g'"'idngmp)g''" 

- g'\d,dmm)g"'' + 5^'(5„<?fc,)5'"(a,3™p)5^" - ^5''(a,fffez)ff'™5"''(9™5p^) ^^-^^^ 
+ 5^' W"'{d,gmn)g''P{dkgpi) + g'"\d,d,gi^)] 



Expanding the first two terms of the exponentials in (C.ll) and (C.12), we have 



(M?(^a, « C + e% + '-^{JabaT, (C.15) 

(E?(ea,6))'-'' + 2eV,)/. (C.16) 

Keeping only the O(e^) terms of the expansion, we recover the Euler discretisation proposal parameters 
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